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1.  Introduction 


This  document  describes  a  series  of  kernel  benchmarks  for  the  PCA  program.  Each  kernel  bench¬ 
mark  is  an  operation  of  importance  to  DoD  sensor  applications  making  use  of  a  PCA  architecture. 
Many  of  these  operations  are  a  part  of  the  composite  example  application  described  elsewhere  [II]. 

The  kernel-level  benchmarks  have  been  chosen  to  stress  both  computation  and  communication 
aspects  of  the  architecture.  “Computation”  aspects  include  floating-point  and  integer  performance, 
as  well  as  the  memory  hierarchy,  while  the  “communication”  aspects  include  the  network,  the 
memory  hierarchy,  and  the  I/O  capabilities.  The  particular  benchmarks  chosen  are  based  on  the 
frequency  of  their  use  in  current  and  future  applications.  They  are  drawn  from  the  areas  of  signal 
processing,  communication,  and  information  and  knowledge  processing.  The  specification  of  the 
benchmarks  in  this  document  is  meant  to  be  high-level  and  largely  independent  of  the  implemen¬ 
tation. 


2.  Metrics 


For  each  benchmark,  a  set  of  problem  sizes  are  defined.  Throughout  this  section,  we  refer  to 
the  kernel  by  the  index  k,  and  refer  to  particular  data  sets  for  a  given  kernel  as  d„  where  i,  = 

1.2 . iV/,.,  and  JV/,.  varies  from  kernel  to  kernel.  We  assume  that  the  data  for  the  problem  begins 

in  a  “staging  area”  accessible  to  the  PCA  computation  units  (“main  memory”  or  “an  I/O  stream”) 
and  must  be  moved  into  local  memory.  For  the  initial  realization  of  the  benchmarks,  the  staging 
area  can  be  main  memory.  Final  measurements  in  the  next  phase  of  the  PCA  program  should  have 
an  I/O  stream  as  the  staging  area. 

There  are  two  major  metrics  of  interest  for  each  problem  size.  The  first  is  the  total  time  or 
latency,  L{{k.  dj),  to  perform  kernel  k  for  a  data  set  size  dj  using  a  single  PCA  prototype  integrated 
circuit  (hereafter,  a  PCA  chip).  This  measurement  should  include  both  computation  time  and  the 
time  to  move  the  data  for  the  problem  from  the  staging  area  (off  the  PCA  chip)  to  a  computation 
or  operation  area  (on  the  PCA  chip). 

The  second  major  metric  of  interest  is  the  sustained  achievable  throughput,  T(k.dj).  For  each 
kernel  k  and  problem  size  d,,  a  measure  of  the  workload ,  W(k.d,),  is  defined  in  an  operation- 
dependent  and  system-independent  way.  (For  floating-point  computation  operations,  IF  is  the 
floating-point  operation  count,  while  for  communication  operations,  W  is  the  number  of  bytes 
transferred.)  The  sustained  achievable  throughput  is 


T(k.di) 


nW{k ,  dj) 


n—oc  Ln(k\  dj) 


(I) 


where  Ln(k\d,)  is  the  total  time  to  solve  n  problems  of  the  given  type  using  the  PCA  chip.  As 
above,  L„(k,  d,)  includes  the  time  to  move  the  data  from  the  staging  area  to  an  operation  area. 

There  are  clear  trade-offs  between  throughput  and  latency.  If  the  entire  PCA  chip  is  being  used 
to  solve  kernel  k  for  data  set  dj,  then  Ln  =  nL\  and  T  —  W[L\.  In  some  cases,  however,  an 
operation  will  be  able  to  take  advantage  of  pipelining  and  petform  multiple  computations  of  the 
same  type  at  the  same  time,  resulting  in  higher  throughput.  Obviously,  the  extent  to  which  this  can 
be  accomplished  will  depend  on  the  input  bandwidth  of  the  PCA  chip.  To  measure  the  throughput 
for  our  purposes,  it  is  sufficient  to  measure  Ln  for  a  value  of  v  that  is  sufficiently  large  (at  least 
7i  >  10,  and  preferably  n  >  100). 

For  embedded  systems  we  are  interested  in  the  efficiency  of  the  operation,  that  is,  the  use  of 
resources  relative  to  the  potential  of  the  device.  In  general  the  efficiency  E{k ,  d, )  is  defined  as 


Eik.dj) 


T(k\  dj) 
U(k) 


(2) 


where  U(k)  is  the  kernel-dependent  upper  bound  or  peak  performance  of  the  chip.  The  definition 
of  U{k)  is  linked  to  the  definition  of  the  workload.  When  W  is  in  floating-point  operations,  U(k) 
is  the  theoretical  peak  floating-point  computation  rate  (based  on  the  clock  rate  and  the  number  of 
floating-point  units).  For  a  communication  operation,  where  workload  is  defined  in  bytes,  U{k)  is 
the  theoretical  peak  bandwidth  between  the  communicating  units.  For  benchmarks  other  than  the 
signal  processing  and  communication  benchmarks,  efficiency  is  difficult  to  calculate  because  peak 
performance  for  the  corresponding  workloads  cannot  easily  be  defined.  For  example,  the  workload 
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for  the  database  benchmark  is  in  transactions,  and  there  does  not  exist  an  easily  calculable  peak 
performance  for  the  number  of  transactions  performed.  In  these  situations,  efficiency  cannot  be 
calculated. 

One  of  the  key  metrics  for  the  PCA  program  is  stability  of  the  performance.  Kuck  [9,  p.  1 68ffJ 
defines  stability  as  the  ratio  of  the  minimum  achieved  performance  to  the  maximum  achieved 
performance  over  a  set  of  data  set  sizes  and  programs.  Stability  is  defined  in  two  senses  for  the 
kernel  benchmarks,  a  per-kernel  sense  and  an  overall  sense.  The  per-kernel  stability  is  reflected  by 
a  metric  called  data  set  stability,  S(i,  defined  as  the  stability  for  a  particular  kernel  over  all  data  sets 
for  that  kernel. 


_  miiirf,  T(k.d,) 

max;rf(  T(k.  d> ) 


(3) 


Stability  across  all  kernels  poses  a  problem,  as  the  workloads  and  thus  the  throughput  calcula¬ 
tions  are  different  for  different  kernels.  However,  a  good  indication  of  the  overall  stability  can  be 
gleaned  from  the  geometric  mean  of  the  kernel  stabilities, 


Sd  = 


(4) 


Finally,  for  embedded  systems,  an  important  metric  is  the  achieved  performance  per  unit  power 
consumed  by  the  PCA  chip. 


C(k.d,)  = 


TjLdy 

p(k.d,y 


(5) 


where  P(k,dj )  is  the  overall  power  consumed  during  the  operation.  This  normalized  quantity  C 
gives  some  indication  of  the  “cost”  of  executing  the  benchmark  on  the  given  PCA  chip.  Obviously, 
this  metric  ignores  power  consumed  by  other  elements  of  the  system,  but  allows  comparison  with 
commercially  available  processors  using  the  same  metric.  Performance  metrics  per  unit  size  and 
weight  are  omitted,  as  the  processing  unit  is  perceived  to  be  less  of  a  driver  for  either  of  these 
quantities  than  for  power  consumed  by  the  system. 

Along  with  the  specified  measurements,  developers  are  asked  to  provide  their  implementa¬ 
tion  of  the  algorithm  and  a  description  of  chip  resource  usage.  The  specific  implementation  of 
the  benchmark  used  to  achieve  the  measured  latency  and  throughput  is  of  great  interest  for  several 
reasons.  One  important  reason  is  to  compare  the  throughput  and  latency  achievable  by  different  ap¬ 
proaches.  For  this  comparison,  the  availability  of  multiple  implementations  of  the  same  algorithm 
on  the  same  architecture  using  different  approaches  would  be  ideal. 

Another  important  reason  to  examine  the  implementation  has  to  do  with  the  workload,  \  Y(k.d,), 
which  is  defined  for  a  particular  implementation  of  the  kernel.  This  standard  workload  and  imple¬ 
mentation  allows  comparison  of  different  architectures.  If  an  additional  algorithm  with  a  signifi¬ 
cantly  different  workload  is  also  implemented,  the  value  of  W  must  be  adjusted  or  the  workload 
is  invalid.  Therefore,  developers  are  asked  to  provide  an  estimate  of  the  workload  for  implementa¬ 
tions  that  are  substantially  different  from  those  described  in  this  report. 

In  summary,  developers  are  requested  to  measure  the  latency,  throughput,  and  power  consumed 
for  each  kernel  benchmark  k  and  data  set  d,.  The  theoretical  peak  floating-point,  operation,  and 
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communication  rates  should  be  reported  for  the  chip.1  All  other  metrics  (efficiency,  stability  over 
problem  size,  stability  over  all  kernels,  and  performance  per  unit  power)  are  derived  from  chip 
parameters  and  the  measured  quantities.  Other  statistics  such  as  variance  may  be  appropriate  also 
and  may  be  calculated  from  these  results.  The  desired  quantities  are  summarized  in  Table  1 . 


Table  1 . 

Benchmark  metrics. 


Parameter 

Description 

Calculation 

Lj(k.  dj) 

Total  latency  for  j  problems 

measured 

W(k.di) 

Workload 

given 

T{k.di) 

Throughput 

lim  ""TOT)- 

Ullln-^ac  L„{k.d,) 

U(k) 

Performance  upper  bound  for  operation  type: 

floating-point 

clock  rate  *  floating-point  units 

communication 

bandwidth  between 
communicating  units 

integer 

clock  rate  *  integer  units 

EM) 

Efficiency 

wnr,  r~ 

t'tfc) 

Sd(k) 

Stability  over  data  sets 

min4ii  T(k,dt) 

inax^  T{k.dt) 

sd 

Mean  of  data  set  stabilities 

v/nL,  s„(k) 

PM) 

Power  consumed 

measured 

C(k.di) 

Performance-power  efficiency 

T(M.) 

P(k.di) 

'These  rales  should  be  specific  to  an  agreed-upon  technology  as  discussed  ai  the  Morphware  forum  meeting  in  July 
2002. 
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3.  Signal  Processing  Benchmarks 


Four  signal  processing  kernels  are  included  in  the  benchmark  set.  They  present  a  range  of  a  differ¬ 
ent  characteristics  in  terms  of  operation  counts  and  memory  references. 

3.1  Finite  Impulse  Response  Filter  Bank 

The  finite  impulse  response  (FIR)  filter  is  one  of  the  basic  operations  in  signal  processing.  This 
kernel  implements  a  set  of  M  FIR  filters.  Each  FIR  filter  m,  m  E  {0, 1, ... ,  M  —  1},  has  a  set  of 
impulse  response  coefficients  ium[A~],  k  E  {0  ...  K  —  1}.  If  the  length  of  the  input  vector  is  N,  the 
output  of  filter  m,  ym,  is  the  convolution  of  wm  with  the  input  x,„: 

K-  1 

tjm [']  =  ]T^  -  k\wm [A'],  for  ?'  =  0. 1, . . .  N  -  1. 

k=0 

The  filter  is  often  implemented  using  fast  convolution  with  the  fast  Fourier  transform  (FFT). 
The  most  efficient  implementation  depends  on  various  factors  including  the  size  of  the  filter  re¬ 
sponse  vector  (for  more  details,  see  [12]).  We  define  two  data  sets  in  Table  2,  one  for  a  short 
filter  and  one  for  a  long  filter.  We  define  the  workload  for  this  kernel  as  the  minimum  workload 
of  the  time-domain  and  frequency-domain  implementations.  For  the  long  filter,  the  frequency- 
domain  implementation  operation  count  (i\/(10iV  log2  N  +  8 N))  is  given.  For  the  short  filter,  the 
time-domain  implementation  operation  count  (8 MX K)  is  given. 


Table  2. 

FIR  filter  bank  input  parameters. 


Parameter 

Name 

Description 

Values 

Set  1 

Set  2 

AJ 

Number  of  filters 

64 

20 

N 

Length  of  input  and  output  vectors 

4096 

1024 

K 

Number  of  filter  coefficients 

128 

12 

W 

Workload  (Mflop) 

34 

1.97 

All  operation  counts  for  the  FIR  filter  assume  complex  input  and  output  data.  In  addition, 
the  frequency-domain  operation  count  assumes  that  an  FFT  and  inverse  FFT  are  implemented 
using  a  radix-2  algorithm.  Many  other  implementations  of  the  FFT  are  possible:  see  Van  Loan 
for  a  discussion  of  algorithms  for  performing  the  FFT  [16].  Because  such  a  wide  variety  of  FFT 
implementations  exist,  and  the  particular  algorithm  implemented  in  a  given  library  may  not  be 
known  or  may  change  with  data  size,  it  is  common  to  use  the  radix-2  operation  count  to  calculate 
throughput  even  when  a  different  algorithm  may  be  in  use. 
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3.2  QR  Factorization 


The  QR  factorization  is  a  linear  algebra  operation  that  factors  a  matrix  into  an  orthogonal 
component,  which  is  a  basis  for  the  row  space  of  the  matrix,  and  a  triangular  component.  In 
adaptive  signal  processing,  the  QR  is  often  used  in  conjunction  with  a  triangular  solver. 

The  QR  factorization  of  an  rri  x  n.  matrix  A  with  m  >  n  is  a  pair  of  matrices  A  =  QR,  where 
the  unitary  matrix  Q  is  of  size  m  x  m  and  the  upper-triangular  matrix  R  is  of  size  m  x  n.  There 
are  many  ways  of  calculating  the  QR  factorization,  as  discussed  in  Golub  and  Van  Loan,  including 
the  Householder  transformation  method  [7,  Algorithm  5.2.1],  the  Modified  Gram-Schmidt  algo¬ 
rithm  [7,  Algorithm  5.2.5],  and  the  Fast  Givens  method  [7,  Algorithm  5.2.4].  Of  these,  we  chose 
to  specify  the  Fast  Givens  method  for  this  kernel  benchmark.  This  was  primarily  done  because  the 
Fast  Givens  method  consists  of  a  number  of  fine  grain  calculations.  This  structure  is  very  suitable 
for  implementation  as  a  stream  algorithm  on  PCAs  such  as  the  MIT  RAW  machine.  For  more  de¬ 
tails,  see  Hoffmann  [8].  The  data  matrix  sizes  that  we  define  for  this  kernel,  and  the  corresponding 
workloads  for  calculating  the  QR  factorization,  are  given  in  Table  3. 


Table  3. 

QR  input  parameters. 


Parameter 

Name 

Description 

Values 

Set  1 

Set  2 

Set  3 

m 

Matrix  rows 

500 

180 

150 

71 

Matrix  columns 

100 

60 

150 

\v 

Workload  (Mflop) 

397 

30.5 

45.0 

3.3  Singular  Value  Decomposition 

The  singular  value  decomposition  (SVD)  is  of  increasing  importance  in  signal  processing.  It 
is  an  advanced  linear  algebra  operation  that  produces  a  basis  for  the  row  and  column  space  of  the 
matrix  and  an  indication  of  the  rank  of  the  matrix.  In  adaptive  signal  processing,  the  matrix  rank 
and  the  basis  are  useful  for  reducing  the  effects  of  interference. 


Table  4. 

SVD  input  parameters. 


Parameter 

Name 

Description 

Values 

Set  1 

Set  2 

Set  3 

rn 

Matrix  rows 

500 

180 

150 

n 

Matrix  columns 

100 

60 

150 

it;, 

Fixed  workload  (Mflop) 

101 

15 

72 

wr2 

Workload  per  iteration  (Mflop) 

0.24 

0.88 

0.54 
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Given  an  m  x  n  complex  matrix  A,  the  singular  value  decomposition  of  A  is 

A  =  UE  VH.  (6) 

where  U  is  a  unitary  matrix  of  size  rn  x  m,  E  is  an  rn  x  n  matrix  in  which  the  upper  n  x  n  portion 
is  a  diagonal  matrix  with  all  entries  real  and  sorted  in  descending  order,  and  V  is  an  n  x  n  unitary 
matrix.  If  rn  >  n,  then  define 


U  =  [  U a  Ub  ]  , 


where  Un  is  size  rn  x  n,  Ut>  is  size  rn  x  (rn  —  n),  and  Ea  is  of  size  n  x  n.  Then  A  =  U„ Ea V H 
is  called  the  reduced  SVD  of  the  matrix  A.  In  this  context  the  SVD  defined  in  equation  (6)  is 
sometimes  referred  to  as  the  full  SVD  for  contrast.  Notice  that  Ua  is  not  unitary,  but  it  does  have 
orthogonal  columns.  When  rn  <  n,  the  reduced  SVD  can  be  similarly  defined  by  partitioning  V 
instead  of  U. 

For  signal  processing  applications,  we  are  typically  most  interested  in  the  reduced  decompo¬ 
sition,  in  the  matrix  U,  and  in  the  singular  values  (the  values  on  the  diagonal  of  E).  We  provide 
operation  counts  for  the  reduced  decomposition,  assuming  that  all  three  matrices  are  produced. 
The  data  matrix  sizes  of  interest  and  associated  operation  counts  are  given  in  Table  4. 

There  are  three  major  steps  to  the  full  SVD  algorithm,  which  are  described  in  more  detail  in 
Golub  and  Van  Loan  [7],  First,  if  the  m  x  n  matrix  A  has  many  more  rows  than  columns,  a  QR 
factorization  is  performed.  This  step  is  done  if  rn  >  5n/3  [7,  p.  252],  which  is  typically  the  case 
in  signal  processing. 

Define 


Q 

/? 


[  Qa  Qft]. 

'  JRo  ' 

0  ' 


(7) 

(8) 


where  Q„  is  size  rn  x  n,  Q (,  is  size  rn  x  ( rri  —  n ),  and  Ra  is  size  n  x  n .  The  decomposition  A  —  Q„Ra 
is  referred  to  as  the  reduced  QR  decomposition  of  A.  Matrix  Q„  is  not  unitary,  but  it  has  orthogonal 
columns.  The  reduced  QR  factorization  can  be  obtained  by  the  modified  Gram-Schmidt  algorithm 
described  in  Golub  and  Van  Loan  [7,  Algorithm  5.2.5].  If  a  full  SVD  is  being  performed,  the 
full  QR  is  computed:  if  a  reduced  SVD  is  being  performed,  a  reduced  QR  is  computed.  In  the 
remainder  of  this  exposition,  we  describe  the  reduced  SVD  algorithm  for  a  matrix  with  in  >  n. 
We  assume  that  in  the  first  step,  we  perform  a  reduced  QR  decomposition  via  the  MGS  algorithm 
to  produce 

A  =  U,R, 

where  R  is  an  n  x  n  upper-triangular  matrix  and  U\  is  an  rn  x  n  matrix  with  orthogonal  columns. 
(Notice  that  the  QR  factorization  described  in  Section  3.2  is  a  full  QR;  hence,  a  different  algorithm 
is  used  here.) 

In  the  next  step,  R  is  reduced  to  bi-diagonal  form,  to  consist  of  the  main  diagonal  and  a  single 
diagonal  of  entries  above  that,  with  the  remainder  of  the  entries  in  the  matrix  being  zero  [7,  p.  253], 
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This  is  accomplished  with  Householder  transforms,  producing 


R  =  U2BV2h. 


where  U2  and  V2  are  unitary  and  of  size  n  x  n ,  and  the  n  x  n  matrix  D  is  bi-diagonal.  The  matrix 
B  produced  at  this  step  is  no  longer  complex,  but  real,  though  matrices  U2  and  V2  are  complex. 

The  final  step  is  an  iterative  reduction  of  B  to  diagonal  form  and  the  ordering  of  the  singular 
values.  This  is  accomplished  with  Givens  rotations  [7,  p.  454],  At  the  end  of  this  step  we  have 
produced  matrices  n  x  n  orthogonal  matrices  U3  and  V3  such  that 


B  =  U3  EFf . 


so  that  the  singular  value  decomposition  of  the  original  matrix  A  can  be  expressed  as 


A  =  UiU2U:i'Z{V:iV2)fl 
=  UT,VH . 


with  U  =  Ux U2lh  and  V  =  V3V2. 

As  the  exact  number  of  iterations  required  to  produce  the  SVD  is  based  on  the  data,  an  effi¬ 
ciency  measurement  must  take  into  account  the  actual  number  of  iteration  steps  performed.  We 
account  for  this  by  defining  the  workload  IF  as  a  linear  function  of  two  numbers  Il  'i  and  114  given 
in  Table  4, 


IF  =  W\  T  d  *  IF2. 


(9) 


where  d  is  the  number  of  iteration  steps  performed  in  the  reduction  of  B  to  diagonal  form,  114  is  an 
estimate  of  the  number  of  floating-point  operations  per  iteration  step,  and  IF(  is  an  estimate  of  the 
number  of  floating-point  operations  in  the  remainder  of  the  algorithm.  The  estimates  I F(  and  114 
for  complex  matrices  are  given  separately  for  the  full  and  reduced  SVD  algorithms  in  Table  4.  The 
number  <7  for  a  given  data  set  must  be  “discovered”  in  the  course  of  the  execution  of  the  benchmark. 

There  are  many  implementation  details  associated  with  achieving  high  performance  in  the  sin¬ 
gular  value  decomposition.  Examples  of  such  details  include  the  use  of  block  Householder  trans¬ 
forms  [7,  p.  2 1 3]  and  the  storage  of  the  Householder  transforms  and  Givens  rotations  that  produce 
U  and  V  rather  than  the  matrices  themselves  [3]. 

3.4  Constant  False  Alarm  Rate  Detection 

The  constant  false-alarm  rate  (CFAR)  detection  benchmark  is  an  example  of  data-dependent 
processing  designed  to  find  targets  in  an  environment  of  varying  background  noise.  The  benchmark 
subjects  a  subset  of  a  radar  data  cube  to  this  algorithm. 

Assume  a  data  cube  consisting  of  real  (as  opposed  to  complex)  data  whose  dimensions  are 
beams,  range,  and  dopplers.  During  CFAR  detection,  a  local  noise  estimate  is  computed  from  the 
2Nc/nr  range  gates  near  the  cell  C(?,y,  A:)  under  test.  A  number  of  guard  gates  G  immediately  next 
to  the  cell  under  test  will  not  be  included  in  the  local  noise  estimate  (this  number  does  not  affect 
the  throughput).  For  each  cell  C(/,y,  A  ),  the  value  of  the  noise  estimate  T(t.j ,  A  )  is  calculated  as 


GO) 
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The  range  cells  involved  in  calculating  the  noise  estimate  for  a  particular  vector  are  shown  in 
Figure  I .  For  each  cell  k),  the  quantity  \C(i,j,  k)\2/T(i,j ,  k)  is  calculated:  this  represents 

the  normalized  power  in  the  cell  under  test.  If  this  normalized  power  exceeds  a  threshold  //.,  the 
cell  is  considered  to  contain  a  target. 


Cell  Under  Test 

C(i,j,k) 


Figure  I .  Sliding  window  in  CFAR  detection.  The  example  shows  the  number  of  guard 
cells  G  =  1  and  the  number  of  cells  used  in  computing  the  estimate  Nrfar  —  3. 


An  efficient  implementation  of  the  CFAR  algorithm  makes  use  of  the  redundancy  in  the  com¬ 
putation  of  T  according  to  the  formula  given  in  (10).  Note  that  the  relationship  between  T(i.j.k) 
and  T{i.j  +  1,  k)  is 


T(i .  j  +  1.  A’)  —  T(i ,  j,  k)  +  — — — (\C(iJ  +  l  +  G  +  .Vefar.k)\- 

’c/ar 

+  \C{Lj-G.k)\2 

—  \C(i.j  —  G  —  Ncfnr,  k)\2 

—  \C(i.  j  +  G  +  1,  A- )  | 2 ) . 

Using  this  relationship,  the  value  of  T  for  a  particular  set  of  Nrg  range  gates  can  be  calculated  in 
() ( Nrg )  time,  that  is,  independent  of  the  values  of  G  and  Ncfar.  Note  that  some  variations  of  this 
formula  and  equation  (10)  occur  at  the  boundary  areas.  For  the  most  part,  these  are  handled  in  a 
straightforward  fashion:  if  a  computed  index  would  cause  reference  to  a  cell  outside  the  cube’s 
boundaries,  we  ignore  that  term  in  the  computation. 

The  parameter  sets  for  the  CFAR  benchmark  are  shown  in  Table  5. 


Table  5. 


Parameter  sets  for  the  CFAR 

Kernel 

Benchmark. 

Name 

Description 

Set  0 

Set  1 

Set  2 

Set  3 

Units 

Nbm 

Number  of  beams 

16 

48 

48 

16 

beams 

Nrg 

Number  of  range  gates 

64 

3500 

1909 

9900 

range  gates 

Ndop 

Number  of  doppler  bins 

24 

128 

64 

16 

doppler  bins 

Ntgts 

Number  of  targets  that  will  be 
pseudo-randomly  distributed 
in  Radar  data  cube 

30 

30 

30 

30 

targets 

Ncfar 

Number  of  CFAR  range  gates 

5 

10 

10 

20 

range  gates 

G 

CFAR  guard  cells 

4 

8 

8 

16 

range  gates 

mu 

Detection  sensitivity  factor 

100 

too 

100 

100 

ir 

Workload 

0.17 

150 

41 

18 

Mfiop 

12 


4.  Communication  Benchmark 


Many  signal  and  image  processing  applications  operate  on  multi-dimensional  data  in  multiple 
stages,  with  operations  focusing  on  a  different  dimension  in  each  subsequent  stage.  If  the  host 
platform  is  a  parallel  processor,  the  data  are  usually  distributed  across  the  nodes  to  exploit  data 
parallelism,  so  that  each  node  can  operate  in  parallel  on  its  portion  of  the  data  as  the  algorithm 
transitions  from  one  stage  to  the  next.  For  efficiency  reasons,  it  is  desirable  to  perform  a  corner- 
aim  of  the  data.  A  corner  turn  operation  is  defined  as  a  copy  of  the  object  with  a  change  in  the 
storage  order  of  the  underlying  data.  This  may  or  may  not  imply  transposition  of  the  computation 
object,  depending  on  the  implementation.  In  this  section,  we  describe  this  operation  in  more 
detail  and  describe  in  general  terms  an  abstracted,  high-level  application  that  requires  a  corner- 
turn  operation  (already  presented  in  an  earlier  document  [10],  some  of  which  is  repealed  here  for 
completeness). 

An  application  that  requires  a  corner  turn  works  first  on  the  rows  of  an  input  matrix,  and  then 
on  the  columns  of  the  intermediate  result  matrix.  Mathematically,  one  of  the  most  basic  examples 
of  such  an  operation  is  a  multiplication  of  three  matrices, 

Z  =  BhXA.  (II) 

where  B  and  A  are  application-dependent  matrices.  A'  is  the  matrix  of  input  data,  and  Z  is  the 
matrix  of  output  data.  An  example  situation  where  this  might  occur  would  be  a  filtering  operation 
followed  by  a  beamforming  operation.1  Suppose,  as  in  the  previous  description  [10],  that  we 
perform  the  operations  of  Equation  (I  1)  into  two  stages,  the  first  producing  Y  =  XA  and  the 
second  producing  Z  =  BHY .  Then  the  first  stage  is  an  operation  in  which  an  entire  row  of  A'  is 
desired,  and  the  second  is  an  operation  in  which  an  entire  column  of  Y  is  desired.  Thus,  the  two 
stages  suggest  different  optimal  data  layouts. 

The  idea  behind  the  corner-turn  operation  is  to  preserve  data  locality  in  the  dimension  being 
operated  on.  Whether  or  not  a  mathematical  transpose  is  performed  is  implementation-dependent. 
In  multi-dimensional  arrays  in  the  C  programming  language,  the  last  array  index  is  continuous  in 
memory.  In  order  to  perform  a  corner  turn  of  a  C  language  array,  a  transpose  is  required;  that  is, 
the  order  of  some  of  the  dimensions  must  be  reversed  (see  Figure  2). 

Now  consider  an  implementation  with  the  property  that  storage  order  is  independent  of  the 
order  of  the  indexes.  In  such  an  implementation,  it  would  be  possible  to  do  a  corner  turn  without 
requiring  a  transpose  of  the  computation  object.  The  vector,  signal,  and  image  processing  library 
(VSIPL,  [13])  is  an  example  of  a  library  where  this  is  possible:  the  stride  parameters  of  a  VS1PL 
view  allow  the  VSIPL  copy  operation  to  re-arrange  the  underlying  data  without  changing  the  math¬ 
ematical  properties  (see  Figure  3). 2  When  using  objects  with  this  property,  storage  order  may  be 
considered  to  be  a  mapping  issue,  whereas  when  using  standard  C  and  C++  arrays,  storage  order 
is  explicitly  embedded  in  the  application  program. 

The  discussion  above  does  not  consider  the  distribution  of  data  over  processors.  Distribution 
effectively  adds  a  level  of  memory  hierarchy  to  the  performance  of  a  corner  turn:  data  must  be 

'A  filtering  operation  can  he  represented  as  a  matrix-matrix  multiply,  but  would  usually  not  be  implemented  in 
such  a  fashion.  Thus,  the  use  of  matrix  multiplication  here  is  an  additional  level  of  abstraction  about  the  application. 

2ll  would  also  be  possible  to  implement  standard  C/C++  data  structures  with  this  property:  use  of  VSIPL  here  is 
purely  a  matter  of  convenience. 
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The  C  code  to  perform  a  comer  turn  of  two-dimensional  array  A  into  two- 
dimensional  array  B  is 

//  Notice  dimensions  of  B  are  the  reverse  of  those  of  A 
int  A [NX]  [NY]  ,  B [NY]  [NX]  ; 

for  (i  =  0;  i  <  NX;  i++ ) 

for  (j  =0;  j  <  NY;  j  +  +  ) 

B  [  j  ]  [i]  =  A  [i]  [j]  ; 

If  NX  and  NY  were  defined  at  compile  time  to  be  4  and  3,  respectively,  and 

"01  2 

3  4  5  . 

0  7  8 

9  10  11 

then  the  memory  area  underlying  A  is 
A={0123456789  10  11  } . 

Mathematically 

'  0  3  6  9 

B  =  1  4  7  10  =  A7. 

2  5  8  11  _ 

and  after  execution  of  the  above  code  the  memory  area  underlying  B  is 
B={0369147  10  25811}. 

Figure  2.  C  comer  turn  example .  In  matrix  B,  the  data  are  stored  in  corner- turned 
fashion  compared  to  matrix  A,  and  B  is  the  transpose  of  A. 


[4 


The  C  code  to  perform  a  corner  turn  from  VSIPL  matrix  view  A  into  VSIPL  matrix 
view  B  is 


//  Notice  dimensions  of  B  are  the  same  as  those  of  A: 
//  storage  order  is  different 
vsip_mview_i  *A  =  vsip_mcreate_i (NX , 

NY , 

VSIP_ROW , 

VS I P_MEM_NONE ) ; 

vsip_mview_i  *B  =  vsip_mcreate_i (NX, 

NY, 

VS I P_COL , 

VS IP  MEM  NONE) ; 


vs ip_mcopy_i__i (A,  B)  ; 


If  NX  and  NY  were  defined  at  compile  time  to  be  4  and  3,  respectively,  and 

"01  2 
,  3  4  5 

•4  =  6  7  8  - 

9  10  11 


then  the  block  underlying  A  is 
A  =  {  01234567891011}. 

After  execution  of  the  above  code,  B  is  mathematically  the  same  as  A,  and  the 
block  area  underlying  B  is 

B  -  {  0369147  10  25811}. 


Figure  3.  VSIPL  comer  turn  example .  Matrix  views  A  and  B  are  mathematically  the 
same  even  though  the  underlying  data  in  B  are  stored  in  corner- turn  fashion  compared  to 
matrix  A. 
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copied  to  a  new  processor  as  well  as  re-arranged  on  the  new  processor.  Frequently,  an  all-to-all 
communication  operation,  in  which  every  processor  communicates  with  every  other  processor,  is 
required  as  part  of  a  distributed  corner  turn. 

Parameters  for  two  corner  turn  sizes  are  given  in  Table  6.  These  corner  turn  sizes  are  based 
on  current  applications:  they  assume  32-bit  data,  either  integers  or  single-precision  floating-point 
data.  The  workload  is  twice  the  overall  matrix  size  since  the  data  is  being  copied  and  must  therefore 
pass  into  and  out  of  the  processor. 


Table  6. 


Corner  turn  parameters. 


Parameter 

Name 

Description 

Values 

Set  1 

Set  2 

M 

Matrix  rows 

50 

750 

N 

Matrix  columns 

5000 

5000 

k 

Element  size  (bytes) 

4 

4 

Matrix  size  (Mbyte) 

1 

15 

W 

Workload  (Mbyte) 

2 

30 

For  this  particular  kernel  benchmark,  the  idea  of  timing  throughput  and  latency  based  on  a 
single  processor  is  acceptable,  but  it  would  be  preferable  to  get  a  sense  of  the  throughput  and 
latency  possible  for  a  multi-processor  corner  turn  of  the  data.  In  the  most  frequent  occurrences  of  a 
distributed  corner  turn  in  signal  processing,  the  source  and  destination  processor  groups  are  either 
identical  or  are  completely  disjoint.  The  derived  statistics  of  most  interest  for  multi-processor 
transfers  are  the  stability  of  the  throughput  over  the  number  of  processors  used,  and  the  efficiency 
of  the  transfer  versus  the  theoretical  peak  bandwidth. 
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5.  Information  and  Knowledge  Processing  Benchmarks 


5.1  Pattern  Matching 

The  pattern  matching  kernel  is  extracted  from  the  feature-aided  tracking  portion  of  the  inte¬ 
grated  radar-tracker  application  [2],  Fundamentally,  this  step  entails  overlaying  two  length-yV  vec¬ 
tors  a  and  t  and  computing  a  metric  that  quantifies  the  degree  to  which  these  two  vectors  match. 
In  general,  the  vector  /  is  chosen  from  a  set  of  reference  vectors  referred  to  as  the  template  library. 
The  metric  used  for  matching  is  the  weighted  MSE  (mean  square  error)  e. 


s 


^2  (wk  *  (a*.  -  tk)2) 


(12) 


where  k  =  1,2 . N  is  the  vector  of  weights.  The  optimal  weights  for  the  feature-aided 

tracker  have  been  computed  empirically.  In  the  kernel  benchmark,  we  provide  a  generic  weighting 
vector. 

The  calculation  done  in  equation  ( 12)  is  performed  on  data  that  has  been  converted  to  decibels 
(the  “dB  domain”).  This  is  done  because  the  raw  power  output  from  a  signal  processing  system  can 
vary  by  many  orders  of  magnitude.  However,  conversion  of  patterns  between  the  power  domain 
and  the  dB  domain  is  performed  during  the  course  of  the  benchmark:  this  requires  the  use  of 
a  number  of  logarithm  and  exponentiation  functions.  The  operation  count  for  these  functions  is 
implementation-dependent,  and  so  the  workload  we  give  has  three  components:  a  count  of  the 
number  of  calls  to  the  exponent  function,  a  count  of  the  number  of  calls  to  the  logarithm  function, 
and  a  count  of  the  operations  in  the  rest  of  the  benchmark. 

Before  the  two  profiles  can  be  overlaid,  they  may  need  to  be  shifted  in  range  to  the  left  or  right, 
and  the  magnitude  of  the  profiles  may  need  to  be  scaled  to  match.  The  optimal  shift  and  gain  values 
can  be  found  through  brute  force  by  computing  the  MSE  for  each  combination  of  shift  and  gain 
values,  then  taking  the  minimum  MSE.  However,  by  noting  that  the  MSE  is  a  parabolic  function 
of  the  shift  and  gain,  we  can  find  the  optimum  shift  and  gain  values  at  the  global  minimum  by 
first  finding  the  optimal  shift,  then  finding  the  optimal  gain  value.  A  summary  of  this  procedure  is 
shown  in  Figure  4. 

The  parameters  of  interest  for  the  pattern  matching  benchmark  are  the  length  of  the  pattern 
vectors,  the  size  of  the  template  pattern  library,  and  the  number  of  shift  and  scale  operations  per¬ 
formed.  These  parameters  are  given  for  two  data  set  sizes  in  Table  7. 

5.2  Database  Operations 

We  consider  measuring  the  performance  of  database  operations  in  the  context  of  a  tracking 
application  that  stores  track  information  in  a  database.  Tracks  are  indexed  using  their  location 
(spatial  coordinates).  The  tracker  operates  in  discrete  time  intervals  called  cycles.  During  each 
cycle,  the  tracker  receives  a  set  of  target  reports  from  a  radar.  It  asks  the  database  to  search  for  all 
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for  each  of  K  patterns 

for  each  of  Sr  shift  values 

calculate  MSE  value  with  shifted  pattern 
Choose  shift  value  with  smallest  MSE 
for  each  of  Sm  magnitude  values 

calculate  MSE  value  with  scaled  pattern 
Choose  gain  value  with  the  smallest  MSE 


Figure  4.  Outline  of  the  pattern  match  kernel. 


Table  7. 

Pattern  matching  parameters. 


Parameter 

Name 

Description 

Values 

Set  1 

Set  2 

N 

Pattern  length 

64 

128 

K 

Number  of  patterns 

72 

256 

Sr 

Number  of  shifts 

21 

43 

sm 

Number  of  magnitude  scalings 

21 

21 

vr, 

Workload:  log10  function  calls 

4.61  x  1CF 

32.8  x  1():{ 

ir2 

Workload:  1CF  function  calls 

4.61  x  103 

32.8  x  1():{ 

Workload:  Other  floating-point  ops  (flops) 

1.20  x  10(i 

13.6  x  1()6 

tracks  that  could  be  associated  with  each  target  report,  based  on  location.  The  tracker  may  direct 
the  database  to  insert  new  tracks  based  on  target  reports  that  are  not  associated  with  any  tracks, 
and  to  delete  specific  tracks. 

The  database  interface  receives  a  stream  of  instructions  from  the  tracker  consisting  of  the 
search,  insert,  and  delete  operations  to  be  performed.  Its  output  is  a  set  of  record  identifiers  which 
are  presumably  used  to  look  up  the  actual  records  in  memory.  As  the  actual  database  does  not 
exist,  the  numbers  are  essentially  random  32-bit  integers.  Our  goal  is  to  measure  the  performance 
of  the  search,  insert,  and  delete  operations,  without  ever  altering  the  contents  of  any  particular 
record.  The  major  motivation  for  this  is  to  avoid  generating  the  large  amount  of  data  necessary  for 
the  database.  A  typical  record  from  a  feature-aided  track  application  is  on  the  order  of  650  bytes 
per  record  (see  document  PCA-IRT-4,  [2]),  and  test  cases  of  interest  may  require  up  to  100,000 
such  records.  Thus,  in  the  benchmark,  we  do  not  actually  generate  and  maintain  the  contents  of  the 
database  itself,  only  the  indexing  structures.  Therefore,  the  structures  used  only  store  the  values 
we  need:  :r  and  tj  coordinates,  and  a  record  identifier,  which  is  an  32-bit  integer  index  of  or  pointer 
to  a  data  record. 

For  an  application  of  this  type,  the  three  database  operations  needed  are: 

search  Look  up  and  retrieve  all  items  whose  characteristics  fall  into  a  given  range.  In  this  case, 
a  search  is  done  for  all  targets  within  a  specified  range  of  a  particular  (x.  tj )  coordinate  pair, 
where  .r  and  y  are  floating-point  numbers.  Given  a  set  of  sector  bounds  {.r/u,„.  .ry\/„x.  t nrm- 
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Table  8. 


Tracking  parameters. 


Parameter 

Values 

Description 

Set  1 

Set  2 

Cycles  to  run 

100 

100 

Total  existing  targets  (P  -I-  U ) 

500 

102,400 

Number  of  placed  targets  (P) 

450 

92,160 

X  total  area,  M  (km) 

5 

32 

Y  total  area,  N  (km) 

5 

32 

X  search  area,  Ax  (km) 

2 

2 

Y  search  area,  Ay  (km) 

2 

2 

Overall  target  density,  d  (targets/km2) 

20 

100 

Search  operations  per  cycle,  ns 

400 

100 

Matches  found  per  search  k 

80 

400 

Insert  operations  per  cycle,  m 

20 

300 

Delete  operations  per  cycle  rid 

20 

300 

Workload  per  cycle  (transactions) 

440 

700 

this  search  can  be  expressed  in  a  fashion  approximating  the  structured  query  language  (SQL) 
as 


select  *  from  TrackDatabase 

where  (,r  >  xA,,„  AND  x  <  x.Mnx  AND  y  >  yKUv  AND  y  <  y^aT). 

insert  Add  a  new  item  to  the  database.  This  can  be  expressed  in  an  SQL-like  fashion  as 
insert  into  TrackDatabase  values(?d,  xu,  yu). 

delete  Remove  an  item  from  the  database,  expressed  in  an  SQL-like  fashion  as 
delete  from  TrackDatabase  where  (x  =  xu  AND  y  =  yu). 

Database  workloads  are  provided  based  on  two  scenarios:  a  kinematic  tracking  scenario  similar 
to  the  parameters  proposed  for  the  integrated  radar-tracker  benchmark,  and  a  multi-hypothesis 
tracking  scenario  in  which  the  database  is  allowed  to  become  much  larger.  For  each  scenario,  the 
frequency  of  each  operation  (search,  insert,  delete)  is  specified.  The  parameters  that  define  these 
two  scenarios  are  given  in  Table  8. 

5.2.1  Test  Data 

Test  data  for  the  database  is  drawn  from  a  tracker  scenario  with  stationary  targets  which  will 
appear  and  disappear  from  the  database.  The  targets  for  the  scenario  will  be  distributed  roughly 
evenly  on  a  grid  of  size  M  x  N  km2.  These  targets  will  be  divided  into  a  set  of  placed  targets,  P, 
and  a  set  of  unplaced  targets,  U .  Targets  in  set  P  will  have  corresponding  records  in  the  database 
(they  have  previously  “appeared”)  and  targets  in  set  U  will  not  (they  have  “disappeared”).  The  size 


19 


of  set  P  is  defined  to  be  sufficient  that  searches  in  an  area  of  size  A.r  x  Ay  will  expect  to  find  k 
targets  on  average;  that  is,  the  target  density  is  d  =  k/{AxAy)  targets/km2.  The  size  of  set  U  is 
chosen  to  allow  sufficient  insertions  and  deletions  with  additional  targets  remaining  to  allow  some 
measure  of  differences  between  searches.  We  have  chosen  to  make  P  90%  of  the  total  targets  and 
U  the  remaining  10%. 

The  database  generator  creates  a  sequence  of  commands  for  the  database.  To  do  this,  the 
generator  must  keep  track  of  the  sets  P  and  U .  For  each  cycle,  the  generator  must  choose  ns 
uniformly  random  (,r ,y)  pairs,  nd  targets  from  P,  and  ni  targets  from  U.  The  ns  pairs  will  be 
passed  to  the  database,  which  will  search  the  records  and  return  those  within  A.r  and  Ay  of  the 
random  pairs.  The  nd  targets  from  P  will  be  passed  to  the  database  to  have  their  records  deleted 
(they  will  “disappear”).  The  ni  targets  from  U  will  be  passed  to  the  database  and  corresponding 
records  will  be  inserted  into  the  database  (thus  will  “appear”).  The  time  reported  for  each  cycle  is 
the  total  time  to  execute  the  search,  insert,  and  delete  commands.  The  command  generation  occurs 
before  any  of  the  actual  benchmarking  and  therefore  is  not  timed. 

5.2.2  Workload 

For  a  workload  value  for  each  scenario,  we  count  each  transaction  (search,  insert,  delete)  as  an 
operation  to  be  performed.  This  workload  value,  given  in  Table  8,  can  be  used  to  compute  through¬ 
put  for  the  database  kernel  (in  transactions  per  second)  and  compare  among  different  architectures. 
However,  an  efficiency  for  the  database  kernel  benchmark  is  not  defined.  Peak  performance  for  the 
database  kernel  benchmark  would  be  calculated  from  the  rate  at  which  the  PCA  chip  can  perform 
each  database  operation,  which  in  turn  is  related  to  the  memory  hierarchy  of  the  entire  system. 

5.3  Graph  Optimization  via  Genetic  Algorithm 

Genetic  algorithms  [4,  6,  14]  have  become  a  viable  solution  to  strategically  perform  a  global 
search  by  means  of  many  local  searches.  The  basis  of  the  genetic  algorithm  methods  is  derived 
from  the  mechanisms  of  evolution  and  natural  genetics.  The  genetic  algorithm  that  is  being  used 
as  one  of  these  kernel  benchmarks  is  a  fairly  simple  version.  Many  modifications  are  possible  that 
can  enhance  the  performance  for  a  given  application,  and  some  small  enhancements  have  been 
made  to  enhance  the  performance  of  this  benchmark. 

A  genetic  algorithm  works  by  building  a  population  of  chromosomes  which  is  a  set  of  possible 
solutions  to  the  optimization  problem.  Within  a  generation  of  a  population,  the  chromosomes 
are  randomly  altered  in  hopes  of  creating  new  chromosomes  that  have  better  evaluation  scores. 
The  next  generation  population  of  chromosomes  is  randomly  selected  from  the  current  generation 
with  selection  probability  based  on  the  evaluation  score  of  each  chromosome.  The  simple  genetic 
algorithm  follows  the  structure  depicted  in  Figure  5.  Each  of  these  operations  will  be  described  in 
the  following  subsections. 


5.3.1  Initialization 

Initialization  involves  setting  the  parameters  for  the  algorithm,  creating  the  scores  for  the  sim¬ 
ulation,  and  creating  the  first  generation  of  chromosomes.  In  this  benchmark,  seven  parameters  are 
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Simple  Genetic  Algorithm  () 

{ 

Initial izat ion ; 

Evaluation; 

while  termination  criterion  has  not  been  reached 

{ 

Select  ion_and_Reproduct  ion  ; 

Crossover ; 

Mutation ; 

Evaluation; 

} 

} 


Figure  5.  Structure  of  a  simple  genetic  algorithm. 


set: 


•  the  genes  value  (Genes)  is  the  number  of  variable  slots  on  a  chromosome; 

•  the  codes  value  (Codes)  is  the  number  of  possible  values  for  each  gene; 

•  the  population  size  (PopSize)  is  the  number  of  chromosomes  in  each  generation; 

•  crossover  probability  (CrossoverProb)  is  the  probability  that  a  pair  of  chromosomes  will 
be  crossed; 

•  mutation  probability  (Mutat  ionProb)  is  the  probability  that  a  gene  on  a  chromosome  will 
be  mutated  randomly; 

•  the  maximum  number  of  generations  (MaxGenerat  ions)  is  a  termination  criterion  which 
sets  the  maximum  number  of  chromosome  populations  that  will  be  generated  before  the  top 
scoring  chromosome  will  be  returned  as  the  search  answer;  and 

•  the  generations  with  no  change  in  highest-scoring  (elite)  chromosome  (GensNoChange)  is 
the  second  termination  criterion  which  is  the  number  of  generations  that  may  pass  with  no 
change  in  the  elite  chromosome  before  that  elite  chromosome  will  be  returned  as  the  search 
answer. 

The  scores  matrix  for  the  simulation,  which  is  generated  in  the  GenAlgGen  script,  is  the  set 
of  scores  for  which  the  best  solution  is  to  be  found.  The  attempted  optimization  is  to  find  the  code 
for  each  gene  in  the  solution  chromosome  that  maximizes  the  average  score  for  the  chromosome. 
Finally,  the  first  generation  of  chromosomes  are  generated  randomly. 

5.3.2  Evaluation 

Each  of  the  chromosomes  in  a  generation  must  be  evaluated  for  the  selection  process.  This  is 
accomplished  by  looking  up  the  score  of  each  gene  in  the  chromosome,  adding  the  scores  up,  and 
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averaging  the  score  for  the  chromosome.  As  part  of  the  evaluation  process,  the  elite  chromosome 
of  the  generation  is  determined. 


5.3.3  Selection  and  Reproduction 

Chromosomes  for  the  next  generation  are  selected  using  the  roulette  wheel  selection  scheme  [14] 
to  implement  proportionate  random  selection.  Each  chromosome  has  a  probability  of  being  cho¬ 
sen  equal  to  its  score  divided  by  the  sum  of  the  scores  of  all  of  the  generation’s  chromosomes. 
In  order  to  avoid  losing  ground  in  finding  the  highest-scoring  chromosome,  elitism  [14]  has  been 
implemented  in  this  benchmark.  Elitism  reserves  two  slots  in  the  next  generation  for  the  highest 
scoring  chromosome  of  the  current  generation,  without  allowing  that  chromosome  to  be  crossed 
over  in  the  next  generation.  In  one  of  those  slots,  the  elite  chromosome  will  also  not  be  subject  to 
mutation  in  the  next  generation. 

5.3.4  Crossover 

In  the  crossover  phase,  all  of  the  chromosomes  (except  for  the  elite  chromosome)  are  paired  up, 
and  with  a  probability  CrossoverProb,  they  are  crossed  over.  The  crossover  is  accomplished 
by  randomly  choosing  a  site  along  the  length  of  the  chromosome,  and  exchanging  the  genes  of  the 
two  chromosomes  for  each  gene  past  this  crossover  site. 

5.3.5  Mutation 

After  the  crossover,  for  each  of  the  genes  of  the  chromosomes  (except  for  the  elite  chromo¬ 
some),  the  gene  will  be  mutated  to  any  one  of  the  codes  with  a  probability  of  MutationProb. 
With  the  crossover  and  mutations  completed,  the  chromosomes  are  once  again  evaluated  for  an¬ 
other  round  of  selection  and  reproduction. 

5.3.6  Termination 

The  loop  of  chromosome  generations  is  terminated  when  certain  conditions  are  met.  When  the 
termination  criteria  are  met,  the  elite  chromosome  is  returned  as  the  best  solution  found  so  far.  For 
this  benchmark,  there  are  two  criteria:  if  the  numberof  generation  has  reached  a  maximum  number, 
MaxGenerations,  or  if  the  elite  solution  has  not  changed  for  a  certain  number  of  generations, 
GensNoChange. 

5.3.7  Implementation  Notes 

Genetic  algorithms  are  being  used  around  the  world  for  an  enormous  variety  of  applications. 
However,  this  benchmark  of  genetic  algorithms  has  been  designed  with  two  specific  purposes  in 
mind:  matching  computational  tasks  with  processing  units  in  a  general  independent  task  envi¬ 
ronment  and  in  signal  processing  pipeline  tasks.  For  the  general  independent  task  environment, 
the  genes  of  the  chromosomes  are  the  computational  tasks  which  have  arrived  in  the  order  of  their 
gene’s  number,  and  the  codes  are  the  possible  processing  units  upon  which  the  computational  tasks 
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Table  9. 


Parameter  sets  for  the  Genetic  Algorithm  Kernel  Benchmark. 


Name 

Description 

Set  1 

Set  2 

Set  3 

Set  4 

Units 

Codes 

Number  of  code  types  for  a 
gene 

4 

8 

100 

1000 

codes 

Genes 

Number  of  genes  on  a  chro¬ 
mosome 

8 

96 

5 

10 

genes 

PopSize 

Number  of  chromosomes  in  a 
generation 

50 

200 

100 

400 

chromosomes 

CrossoverProb 

Probability  of  crossing  over  a 
pair  of  chromosomes 

0.01 

0.002 

0.02 

0.03 

MutationProb 

Probability  of  mutating  a 
chromosome 

0.60 

0.60 

0.60 

0.30 

MaxGenerations 

Maximum  number  of  genera¬ 
tions 

500 

2000 

500 

5000 

generations 

GensNoChange 

Maximum  number  of  genera¬ 
tion  with  no  change  in  elite 
chromosome 

50 

150 

50 

500 

generations 

Ops  per  generation 

1750 

77400 

2300 

17200 

operations 

Random  numbers  per  genera¬ 
tion 

898 

38798 

1  198 

8798 

numbers 

Ops  for  random  number  gen. 

9878 

426778 

13178 

96778 

operations 

Total  Ops 

1 1628 

504178 

15478 

113978 

operations 

can  be  executed.  Similarly,  for  the  signal  processing  pipeline  tasks,  the  genes  of  the  chromosomes 
are  the  pipelined  tasks  that  constitute  a  signal  processing  chain,  and  the  codes  are  the  computa¬ 
tional  unit  mappings  upon  which  the  pipeline  stage  tasks  can  be  run.  The  scores  for  each  code  in 
a  given  gene  position  represents  a  goodness  factor  (for  computational  efficiency,  execution  time, 
or  some  other  measure)  ranging  from  zero  to  one  (one  being  best).  The  goal  then  is  to  find  the 
mapping  of  tasks  onto  processors  that  yields  the  best  score,  in  this  case  a  perfect  score  of  one. 

The  random  number  generator  for  the  genetic  algorithm  is  assumed  to  be  the  one  defined  in 
VS1PL  [13,  p.245].  This  generator  is  used  because  the  implementation  is  available  and  the  work¬ 
load  is  easily  calculable  (based  on  the  discussion  in  the  standard,  we  assume  I  I  ops  per  random 
number  generated).  Use  of  this  random  number  generator  is  not  required.  However,  if  a  different 
random  number  generator  is  used,  the  workload  given  in  Table  9  should  be  altered  accordingly. 

For  this  kernel  benchmark,  four  parameter  sets  have  been  included.  These  sets  are  shown  in 
Table  9.  Sets  1  and  2  are  sample  parameters  for  the  general  independent  task  scenario  while  sets 
3  and  4  are  sample  parameters  for  the  digital  signal  processing  pipeline  task  scenario.  For  this 
kernel,  we  define  the  workload  in  operations  (rather  than  floating-point  operations)  per  generation. 
The  workload  is  related  to  the  number  of  genes  and  the  population  size  per  generation. 
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Parts  of  the  genetic  algorithm  code  are  embarrassingly  parallel,  including  the  crossover  and 
mutation  sections.  Most  of  the  evaluation  section  is  also  embarrassingly  parallel,  except  for  the 
elite  chromosome  determination  portion.  However,  the  selection  and  reproduction  section  can¬ 
not  be  conducted  in  a  parallel  manner  since  a  view  of  the  entire  population  is  necessary.  More 
discussion  on  parallelizing  and  distributing  genetic  algorithms  can  be  found  in  [5]. 
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6.  Further  Kernel  Benchmarks 


These  are  benchmarks  that  might  be  considered  for  a  second  set  of  kernel  benchmarks,  but  are  not 
included  with  this  first  set. 

Image  encoding.  Compress  a  synthetic  aperture  radar  (SAR)  image  for  storage  or  transmission. 
The  dynamic  range  of  SAR  imagery  is  such  that  it  is  not  well  represented  using  conventional 
image  processing  standards  such  as  that  defined  by  the  joint  photographic  experts  group  (JPEG) 
for  image  compression.  The  JPEG  2000  standard  addresses  the  problems  that  JPEG  presents  for 
SAR  images.  For  more  details  on  JPEG  2000,  see  Taubman  and  Marcellin  [15].  Baxter  and  Seibert 
have  performed  an  analysis  of  the  desirable  features  of  an  encoding  algorithm  for  SAR  images  [  I  ]. 
The  major  features  of  the  encoding  algorithm  as  they  describe  it  are: 

•  wavelet  packet  transforms  with  a  Gabor-like  tree  structure  and  smooth  biorthogonal  wavelet 
filters, 

•  trellis-coded  quantization,  and 

•  a  bit-allocation  procedure  based  on  minimizing  distortion  (perceptual  distortion,  based  on 
the  human  visual  system)  and  rate. 

Some  of  these  features  (particularly  trellis-coded  quantization)  are  in  “part  2”  of  the  JPEG  2000 
standard,  and  are  therefore  not  yet  present  in  most  publicly  available  implementations  of  the  stan¬ 
dard. 

Secure  network  protocol.  Transmit  a  message  of  a  given  size  with  authentication  and  encryption, 
based  on  IPSec  or  a  similar  protocol. 

Synchronization.  Mark  a  value  on  a  different  processor  or  in  a  remote  memory  for  exclusive 
access  (lock/unlock  operations). 

Image  processing:  morphological  operations.  Take  an  image  and  perform  an  “opening”  or 
“closing”  operation  on  it.  These  are  integer  convolution  operations. 

Image  processing:  edge  detection.  Perform  edge  detection  in  both  the  s  and  y  dimensions  of  a 
two-dimensional  image. 

Giga-updates  per  second.  Read,  then  write  a  sequence  of  random  memory  locations  in  a  large 

memory  space.  A  description  of  the  benchmark  is  at 

<http : / / iram . cs . berkeley . edu/ ~brg/di s/gups/ >. 

Incomplete  Gamma  function.  Calculate  values  of  the  incomplete  gamma  function. 
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APPENDIX  A 
Revisions 


This  is  the  first  public  revision  of  the  document,  dated  13  June  2005.  It  introduces  the  following 
changes  from  the  original  public  version: 

•  references  to  kernel  benchmark  code  and  its  availability  dates  were  deleted; 

•  the  database  kernel  benchmark  description  was  revised  to  remove  implementation  details; 

•  the  CFAR  workload  and  description  were  revised  to  reflect  real  (as  opposed  to  complex) 
data; 

•  an  outline  of  the  calculations  performed  in  the  pattern  match  kernel  was  added,  and  some 
errors  in  the  operation  count  were  corrected; 

•  the  QR  factorization  section  was  added; 

•  the  SVD  section  was  revised  to  reflect  a  reduced  SVD; 

•  the  corner  turn  workload  was  revised  to  clarify  the  data  size  and  why  the  workload  is  twice 
the  matrix  data  size; 

•  the  FIR  workload  was  slightly  changed  and  some  notes  on  the  operation  counts  were  added; 
and 

•  the  “Revisions”  chapter  that  you  are  reading  now  was  re-ordered  so  that  the  most  recent 
changes  were  at  the  start. 

The  first  public  version  of  the  document  was  issued  23  January  2004.  It  introduced  the  follow¬ 
ing  changes  from  the  program-private  version: 

•  an  error  in  the  CFAR  workload  table  (Table  5)  was  corrected; 

•  a  description  of  the  reduced  SVD  algorithm,  including  workload  estimates,  was  added; 

•  the  description  of  the  database  kernel  benchmark  was  heavily  revised  and  the  workload  was 
changed; 

•  a  description  of  the  random  number  generator  used  for  the  genetic  algorithm  was  added  and 
its  workload  was  updated  in  a  corresponding  way; 

•  the  pattern  match  benchmark  description  and  workload  were  updated  to  reflect  a  more  effi¬ 
cient  implementation  and  to  introduce  a  second  data  set;  and 

•  the  discussion  of  metrics  (in  particular,  of  efficiency  and  program  stability)  was  updated  to 
reflect  that  efficiency  is  not  always  easily  calculable  for  the  different  kernels. 

The  original  version  of  this  document  was  distributed  to  PCA  program  participants  (only)  at 
sponsor  direction,  July  31,  2002. 
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